22 research outputs found

    Checkpointing algorithms and fault prediction

    Get PDF
    This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693

    On the complexity of scheduling checkpoints for computational workflows

    Get PDF
    This paper deals with the complexity of scheduling computational workflows in the presence of Exponential failures. When such a failure occurs, rollback and recovery is used so that the execution can resume from the last checkpointed state. The goal is to minimize the expected execution time, and we have to decide in which order to execute the tasks, and whether to checkpoint or not after the completion of each given task. We show that this scheduling problem is strongly NP-complete, and propose a (polynomial-time) dynamic programming algorithm for the case where the application graph is a linear chain. These results lay the theoretical foundations of the problem, and constitute a prerequisite before discussing scheduling strategies for arbitrary DAGS of moldable tasks subject to general failure distributions

    On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing

    Get PDF
    International audienceProcessor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replica-tion in simulation using both synthetic and real-world failure traces so as to quantify average application makespan. One interesting result from these experiments is that, when process repli-cation is used, application performance is not sensitive to the checkpointing period, provided that that period is within a large neighborhood of the optimal period. More generally, our empirical results make it possible to identify regimes in which process replication is beneficial

    Cost-Optimal Execution of Trees of Boolean Operators with Shared Streams

    Get PDF
    The processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors to a query processing device, such as a smartphone, over one or more network interfaces. Retrieving a data item incurs a cost, e.g., an energy expense that depletes the smartphone's battery. Since the query tree contains boolean operators, part of the tree can be shortcircuited depending on the retrieved sensor data. An interesting problem is to determine the order in which predicates should be evaluated so as to minimize the expected query processing cost. This problem has been studied in previous work assuming that each data stream occurs in a single predicate. In this work we remove this assumption since it does not necessarily hold for real-world queries. Our main results are an optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including the one heuristic proposed in previous work for our general version of the query processing problem.Le traitement de requĂȘtes, exprimĂ©es sous forme d'arbres d'opĂ©rateurs boolĂ©ens appliquĂ©s Ă  des prĂ©dicats sur des flux de donnĂ©es de senseurs, a de nombreuses applications dans le domaine du calcul mobile. Les donnĂ©es doivent ĂȘtre transfĂ©rĂ©es des senseurs vers l'appareil de traitement des donnĂ©es, par exemple un {smartphone}. TransfĂ©rer une donnĂ©e induit un coĂ»t, par exemple une consommation Ă©nergĂ©tique qui diminuera la charge de la batterie du smartphone. Comme l'arbre de requĂȘtes contient des opĂ©rateurs boolĂ©ens, des pans de l'arbre peuvent ĂȘtre court-circuitĂ©s en fonction des donnĂ©es rĂ©cupĂ©rĂ©es. Un problĂšme intĂ©ressant est de dĂ©terminer l'ordre dans lequel les prĂ©dicats doivent ĂȘtre Ă©valuĂ©s afin de minimiser l'espĂ©rance du coĂ»t du traitement de la requĂȘte. Ce problĂšme a dĂ©jĂ  Ă©tĂ© Ă©tudiĂ© sous l'hypothĂšse que chaque flux apparaĂźt dans un seul prĂ©dicat. Dans le prĂ©sent travail nous Ă©liminons cette hypothĂšse qui ne correspond pas forcĂ©ment Ă  la rĂ©alitĂ©. Nos principaux rĂ©sultats sont un algorithme optimal pour les arbres avec un seul niveau, et une preuve de NP-complĂ©tude pour les arbres sous forme normale disjonctive. Pour les arbres sous forme normale disjonctive, cependant, nous montrons qu'il existe un ordre optimal d'Ă©valuation des prĂ©dicats qui correspond Ă  un parcours en profondeur d'abord. Ce rĂ©sultat nous sert Ă  concevoir toute une classe d'heuristiques. Nous montrons que l'une de ces heuristiques a de bien meilleurs rĂ©sultats que les autres heuristiques et, entre autres, que la seule heuristique prĂ©cĂ©demment proposĂ©e pour le cadre gĂ©nĂ©ral

    Cost-Optimal Execution of Boolean Query Trees with Shared Streams

    Get PDF
    International audienceThe processing of queries expressed as trees of boolean operators applied to predicates on sensor data streams has several applications in mobile computing. Sensor data must be retrieved from the sensors, which incurs a cost, e.g., an energy expense that depletes the battery of a mobile query processing device. The objective is to determine the order in which predicates should be evaluated so as to shortcut part of the query evaluation and minimize the expected cost. This problem has been studied assuming that each data stream occurs at a single predicate. In this work we remove this assumption since it does not necessarily hold for real-world queries. Our main results are an optimal algorithm for single-level trees and a proof of NP-completeness for DNF trees. For DNF trees, however, we show that there is an optimal predicate evaluation order that corresponds to a depth-first traversal. This result provides inspiration for a class of heuristics. We show that one of these heuristics largely outperforms other sensible heuristics, including a heuristic proposed in previous work

    Impact of fault prediction on checkpointing strategies

    Get PDF
    This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young and Daly in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or window-based time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale. In addition, the results of this analytical evaluation are nicely corroborated by a comprehensive set of simulations, thereby demonstrating the validity of the model and the accuracy of the results.Comment: 20 page

    Comments on ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpoint''

    Get PDF
    In this short note, we provide some comments on the recent paper ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing'' by Bouguerra et al.. We start by identifying some errors in their equations. Then we explain that they do not actually use the distribution of lead times, contrary to statements by the authors. Finally, we show that their algorithm does not change policy at the best possible moment, and we point to our own work~\cite{rr-journal-prediction} for the (correct version of the) optimal algorithm.Dans cette courte note nous commentons l'article ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing'' de Bouguerra et al.~\cite{SlimIPDPS13}. Nous commençons par identifier des erreurs dans la mise en équation du problÚme. Nous expliquons ensuite que, contrairement à ce qu'ils prétendent, les auteurs n'utilisent pas la distribution du délai de prédiction (\emph{lead time}). Finalement, nous montrons que leur algorithme ne change pas de politique au moment optimum, et nous indiquons que nous avons présenté l'algorithme optimal dans un précédent rapport de recherche

    Using group replication for resilience on exascale systems

    Get PDF
    High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional fault-tolerance solution is checkpoint-recovery, by which the application saves its state to secondary storage throughout execution and recovers from the latest saved state in case of a failure. An oft studied research question is that of the optimal checkpointing strategy: when should state be saved? Unfortunately, even using an optimal checkpointing strategy, the checkpointing frequency must increase as platform scale increases, leading to higher checkpointing overhead. This overhead precludes high parallel efficiency for large-scale platforms, thus mandating other more scalable fault-tolerance mechanisms. One such mechanism is replication, which can be used in addition to checkpoint-recovery. Using replication, multiple processors perform the same computation so that a processor failure does not necessarily imply application failure. While at first glance replication may seem wasteful, it may be significantly more efficient than using solely checkpointrecovery at large scale. In this work we investigate a simple approach where entire application instances are replicated. We provide a theoretical study of checkpoint-recovery with replication in terms of expected application execution time, under an exponential distribution of failures. We design dynamic-programming based algorithms to define checkpointing dates that work under any failure distribution. We also conduct simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems, and using failure logs from production clusters. Our results show that replication is useful in a variety of realistic application and checkpointing cost scenarios for future exascale platforms.
    corecore